A Composing Structured and Text Databases
نویسندگان
چکیده
We postulate a universe of objects in which each object is described by a set of characteristics. Different objects can have the same or different characteristics, but they differ in at least one characteristic. Thus, every object can be conceptually thought to have an implicit unique identity, though the object identity is not manifest. We have two data sources. The first is a collection of structured records, each record containing representative but partial characteristics of some object. The second contains text documents written in a natural language, each discussing some aspects of a small number of objects. There is no marking or structure in the documents that explicitly identifies the objects that the document is about; each text is simply a sequence of words. Under this general setting, we present a framework for composing structured information about the objects with the textual information about them. The framework is centered around the concept of “trait”: a set of characteristics that can serve as the proxy for the identity of an object. Traits might sound similar to database keys, but traits are instance-based rather than schema-based. We present techniques for computing traits, mapping structured records and text documents to traits, and thus joining information about the same object from two repositories. Our extensive experiments using synthetic data demonstrate the effectiveness of our approach under a wide range of operating parameters. Experiments using empirical data validate the results of the synthetic experiments.
منابع مشابه
Writers on the Move: Visualizing Composing Processes Involved in Academic Writing
The present research study aimed to explore covert processes of editing and revision which were involved in writing four different academic text genres (i.e. abstract, conclusion, data commentary, and cover letter) in English language. To this end, six EFL learners with Persian as their mother were recruited to participate in this study. All the participants attended an induction session and ea...
متن کاملQuery by Templates: A Generalized Approach for Visual Query Formulation for Text Dominated Databases
The WWW has a great potential of evolving into a globally distributed digital document library.The primary use of such a library is to retrieve information quickly and easily. Because of the size of these libraries, simple keyword searches often result in too many matches. More complex searches involving boolean expressions are di cult to formulate and understand. This paper describes QBT (Quer...
متن کاملInvestigation on Full-Text Databases Cited in LIS
Background and Aim: The main objective of this research was to investigate the use of full-text databases in the LIS theses of Tehran State Universities within the years 2005 and 2009. Method: For this purpose, the total of 9952 citations related to 172 existing theses in the academic central libraries were studied. The data collected were analyzed by the bibliometrics and citation analysis met...
متن کاملStructured Text Retrieval Models
Structured text retrieval models provide a formal definition or mathematical framework for querying semistructured textual databases. A textual database contains both content and structure. The content is the text itself, and the structure divides the database into separate textual parts and relates those textual parts by some criterion. Often, textual databases can be represented as marked up ...
متن کاملCombining data integration and information extraction
Abstract Improving the ability of computer systems to process text is a significant research challenge. Many applications are based on partially structured databases, where structured data conforming to a schema is combined with free text. Information is stored as text in these applications because the queries requiredImproving the ability of computer systems to process text is a significant re...
متن کامل